ci(disagg): fail before writing result file + surface real failure class#1591
ci(disagg): fail before writing result file + surface real failure class#1591arygupt wants to merge 1 commit into
Conversation
Two independent disagg-CI hardening fixes (both verified drop-in): A) benchmark_serving.py: move the request-failure-rate gate ABOVE the `if args.save_result:` block. Previously the result JSON was written (json.dump + pytorch sidecar) and THEN the gate raised SystemExit, so a zero-/sub-threshold run left a schema-valid JSON on disk. Downstream collectors (launch_mi355x-amds.sh, benchmark-multinode-tmpl.yml) key on file EXISTENCE, not exit code, so a written-then-failed file looks successful. Gating first keeps disk state consistent with the exit code. Complements PR #1590's process_result zero-guard (that one is downstream; this stops the misleading file at the source). B) launch_mi355x-amds.sh cleanup_and_save_logs: after tailing the slurm .err, classify it and emit a single GitHub ::error:: annotation naming the failure class (recipe model-not-found / readiness-timeout / fp8_blockwise IntraNode combine / MoRI transport / segfault), with a generic fallback. Without this the Actions UI shows only "No benchmark result files found". Deterministic recipe errors are checked first so a config bug is never mislabeled as a transport flake. Guarded on GITHUB_ACTIONS. Held for a follow-up: bounded flake-only retry + node-quarantine in benchmark-multinode-tmpl.yml (needs the submit.sh node-allocation path traced to thread a --exclude, and offline classifier validation against canned .err fixtures). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 992b90f. Configure here.
| else | ||
| echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
Error annotation emits on successful runs with stderr
Medium Severity
The cleanup_and_save_logs EXIT trap fires on all exits, including successful ones. The new ::error:: annotation block only checks whether the .err file is non-empty (-s), not whether the job actually failed. Slurm .err files commonly contain content on success (Python deprecation warnings, CUDA init messages, library logs). On a successful run with any stderr output, this will either match a broad pattern like MoRI (line 84 — likely present in normal operational logs) and emit a false transport-flake annotation, or fall through and emit the misleading "Unclassified failure" annotation. This defeats the purpose of surfacing real failure classes by flooding the Actions UI with false positives.
Reviewed by Cursor Bugbot for commit 992b90f. Configure here.
| # Surface the real failure class in the Actions UI. Without this, a | ||
| # launch failure shows only the generic "No benchmark result files | ||
| # found" from benchmark-multinode-tmpl.yml. Order matters: check the | ||
| # deterministic recipe error (model-not-found, #1581) before the | ||
| # transport-flake patterns (#1584 MoRI/readiness) so a config bug is | ||
| # never mislabeled as a flake. | ||
| if [[ -n "${GITHUB_ACTIONS:-}" ]]; then | ||
| local sig="" | ||
| if grep -qiE "Model '.*' not found|FATAL: Model|model .* not found" "$err_file"; then | ||
| sig="recipe-error: model not found (deterministic - check MODEL/MODEL_PATH, not MoRI)" | ||
| elif grep -qiE "ReadTimeout|readiness.*timeout|warmup.*time(d)? ?out|health.*timeout" "$err_file"; then | ||
| sig="transport-flake: readiness/warmup timeout (MoRI pd-disagg)" | ||
| elif grep -qiE "Fp8BlockwiseQuant.*IntraNode|dispatch_combine|combine.*IntraNode" "$err_file"; then | ||
| sig="config-error: MoRI fp8_blockwise combine needs IntraNode (disable TBO/SDMA on FP4 prefill, #1584)" | ||
| elif grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file"; then | ||
| sig="transport-flake: MoRI KV-transport error" | ||
| elif grep -qiE "segfault|Segmentation fault|signal 11|core dumped|gpucore" "$err_file"; then | ||
| sig="transport-flake: server segfault / core dump" | ||
| fi | ||
| if [[ -n "$sig" ]]; then | ||
| echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::${sig} (see slurm .err artifact)" | ||
| else | ||
| echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above" | ||
| fi | ||
| fi |
There was a problem hiding this comment.
🔴 The new ::error:: annotation block in cleanup_and_save_logs is gated only on stderr being non-empty and GITHUB_ACTIONS being set — neither implies the job actually failed. Since the trap runs on EXIT unconditionally (line 101) and SLURM .err is essentially always non-empty on a successful MoRI run (NCCL/MoRI INFO banners, tqdm to stderr, Python warnings, srun output), every green run will emit a red ::error title=AMD disagg job ... failed:: banner — inverting the PR's stated goal. Fix: capture local rc=$? as the very first line of the function and gate the entire ::error:: emission on rc -ne 0 (leaving the diagnostic .err tail unconditional).
Extended reasoning...
What the bug is
cleanup_and_save_logs is registered as a generic EXIT trap at line 101 (trap cleanup_and_save_logs EXIT), so it runs on every script exit — both success (exit 0) and failure. The new annotation block (lines 76–94) is guarded by only two conditions:
[[ -s "$err_file" ]]— stderr file is non-empty (line 66 enclosing block)[[ -n "${GITHUB_ACTIONS:-}" ]]— running under Actions (line 76)
Neither indicates that the job actually failed. There is no $? capture and no exit-code gate.
Why a successful run will hit it
SGLang/vLLM multinode jobs with MoRI + NCCL + SLURM essentially always produce non-empty stderr on success:
tqdmwrites progress bars to stderr by default- NCCL/RCCL prints INFO banners to stderr
warnings.warn(...)writes to stderr- SGLang/vLLM server INFO logs often route to stderr
srun --unbufferedanddocker stop $(docker ps -a -q) || true(job.slurm) leak banners and "no such container" messages to stderr — the trailing|| trueswallows the exit code but not the output- The submit.sh wrapper sets
--error=slurm_job-%j.err, so all of the above lands in the file the trap inspects
Two false-positive paths on green runs
- Pattern match (line 84):
grep -qiE "MoRI|mori_conn|pd[- ]?disagg"matches case-insensitively against the literal stringMoRIanywhere in stderr. For a MoRI-based benchmark this is effectively guaranteed (framework name appears in INFO/log lines and startup banners). Result:::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error. - Else branch (line 92): Even when no class matches, any non-empty unmatched stderr emits
::error title=AMD disagg job <id> failed::Unclassified failure - see last 100 lines of slurm .err above.
Either path renders a red banner in the Actions UI claiming the job FAILED on a run whose exit code is 0 and whose result JSON was written normally.
Step-by-step proof
- Script runs,
trap cleanup_and_save_logs EXITis set (line 101). - Multinode benchmark completes successfully, exit code 0.
cleanup_and_save_logsfires on EXIT.err_file=$BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.erris non-empty (contains NCCL/MoRI INFO banners + tqdm + srun stderr).[[ -s "$err_file" ]]→ true → enter the block.[[ -n "${GITHUB_ACTIONS:-}" ]]→ true (we are in CI) → enter annotation block.- First three patterns (
Model not found,ReadTimeout,Fp8BlockwiseQuant) don't match a clean run. - Fourth elif:
grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file"→ matches the literal "MoRI" in any startup log line →sig="transport-flake: MoRI KV-transport error". - Line 90 emits
::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error (see slurm .err artifact). - Actions UI shows a red error banner on a successful, exit-0 run.
Existing code already acknowledges this
The pre-existing if [[ -s "$err_file" ]] guard at line 66 plus the unconditional tail -100 block already implicitly acknowledges that .err is routinely non-empty on non-failure runs (otherwise the tail print wouldn't be conditional). What the new block does is escalate that benign non-emptiness into a misleading ::error:: annotation.
Impact
The annotation directly defeats the PR's stated goal ("surface real failure class", "fail loudly and legibly"). After a few green-but-red runs, reviewers learn to ignore the annotation entirely, and the next real failure looks identical to the prior false positives.
Fix
Capture the trap-entry exit code as the very first line of cleanup_and_save_logs and gate only the ::error:: emission on it. The diagnostic .err tail (lines 67–69) is fine to leave unconditional since it's only printed output, not an Actions annotation:
cleanup_and_save_logs() {
local rc=$? # MUST be first statement
# ... existing cp/tail logic unchanged ...
if [[ $rc -ne 0 && -n "${GITHUB_ACTIONS:-}" && -s "$err_file" ]]; then
# existing classifier + ::error:: emission
fi
}

What
Two independent, verified disagg-CI hardening fixes that make AMD multinode failures fail loudly and legibly instead of silently producing misleading artifacts.
A —
benchmark_serving.py: gate before writeThe request-failure-rate gate ran after the result JSON was written, so a zero-/sub-threshold run left a schema-valid JSON on disk and then exited non-zero. Downstream collectors (
launch_mi355x-amds.sh,benchmark-multinode-tmpl.yml) key on file existence, not exit code — so a written-then-failed file looked successful. Moved the gate aboveif args.save_result:so disk state stays consistent with the exit code. Complements #1590 (that guard is downstream inprocess_result; this stops the misleading file at the source).B —
launch_mi355x-amds.sh: emit::error::with the real failure classcleanup_and_save_logstailed the slurm.errbut emitted no GitHub annotation, so the Actions UI showed only the generic "No benchmark result files found". Now it classifies the.errand emits one::error::naming the class — recipe model-not-found (#1581) / readiness-timeout / fp8_blockwise IntraNode combine (#1584) / MoRI transport / segfault — deterministic recipe errors checked first so a config bug is never mislabeled as a flake. Guarded onGITHUB_ACTIONS.Held for follow-up
Bounded flake-only retry + node-quarantine in
benchmark-multinode-tmpl.yml. Needs thesubmit.shnode-allocation path traced to thread a--exclude, plus offline classifier validation against canned.errfixtures — not shipping a half-verified control-flow change to production CI.Test
python3 -m py_compilepasses; gate logic unchanged, only reordered (no re-indent). Unit-testable by forcingcompleted=0/completed<95%and asserting no JSON is written + exit≠0.bash -npasses; pure observability inside the existing EXIT trap, runs only underGITHUB_ACTIONS.🤖 Generated with Claude Code
Note
Low Risk
CI-only ordering and observability changes; benchmark gate logic is unchanged, and launcher edits only add stderr pattern matching for Actions annotations.
Overview
Hardens AMD disaggregated multinode CI so failed runs are easier to diagnose and no longer look successful downstream.
In
benchmark_serving.py, the 5% request failure-rate check now runs before any result JSON is written. Collectors treat a result file on disk as success even when the process exits non-zero; moving the gate avoids leaving a valid JSON artifact on sub-threshold or zero-completion runs.In
launch_mi355x-amds.sh,cleanup_and_save_logsstill tails the Slurm.errlog but now emits a GitHub::error::annotation (whenGITHUB_ACTIONSis set) that classifies the failure—recipe/model-not-found and MoRI fp8_blockwise config issues are matched before transport-flake patterns so deterministic misconfigurations are not labeled as flakes.Reviewed by Cursor Bugbot for commit 992b90f. Bugbot is set up for automated code reviews on this repo. Configure here.